Moving from a demo to a production-ready AI system is not about better prompt engineering anymore. It is about rigorous agentic engineering, schema discipline, observability, and a set of habits that are rarely discussed. Here is what actually matters in 2026.
1. Reserve the LLM for Reasoning, Use Deterministic Code for Execution
The single most important architectural shift in 2026: let the LLM do reasoning and intent extraction, then hand its output to deterministic code for execution. Do not let an LLM calculate, query a database, or apply business rules. Extract the intent with the LLM, validate it with a schema, then run a normal Python function or SQL query.
from pydantic import BaseModel
from typing import Literal
class OrderDecision(BaseModel):
action: Literal["approve", "reject", "escalate"]
reason: str
confidence: float
# Step 1: LLM extracts structured intent
response = client.beta.chat.completions.parse(
model="gpt-5.4",
messages=[{"role": "user", "content": f"Evaluate this order: {order_data}"}],
response_format=OrderDecision,
)
decision = response.choices[0].message.parsed
# Step 2: Deterministic code executes the decision
if decision.action == "approve":
db.execute("UPDATE orders SET status='approved' WHERE id=?", [order_id])
elif decision.action == "reject":
send_rejection_email(order_id, decision.reason)
elif decision.action == "escalate":
create_support_ticket(order_id, decision.reason)
# The LLM never touches the database directly
This keeps the LLM's non-determinism contained to the reasoning step. Once you have a validated typed object, execution is fully deterministic and auditable.
2. Start With Evals, Not Prompts
Write your evaluation before writing your prompt. An eval is a test: given this input, what output do I consider correct? Without evals, you iterate by intuition. You change a prompt, run it once, it looks better, and you ship it. You have no idea if you fixed one case and broke three others.
import json
from your_agent import classify_intent
TEST_CASES = [
{"input": "book a flight to Mumbai", "expected": "travel"},
{"input": "what's the weather today", "expected": "weather"},
{"input": "remind me to call mom", "expected": "reminder"},
{"input": "show me my account balance", "expected": "finance"},
]
def run_evals():
passed = 0
for case in TEST_CASES:
result = classify_intent(case["input"])
if result == case["expected"]:
passed += 1
else:
print(f"FAIL: '{case['input']}' got '{result}', expected '{case['expected']}'")
print(f"\n{passed}/{len(TEST_CASES)} passed")
return passed == len(TEST_CASES)
if __name__ == "__main__":
assert run_evals(), "Evals failed — do not ship this prompt"
3. Build for Architectural Impermanence
The complex agent harness you write today might be replaced in months by a new model that does it natively. This is not hypothetical — it has already happened repeatedly. Build modularly. Each agent is a function. Each capability is a module. When a new model makes one module obsolete, you deprecate that module and replace it without touching the rest of the system.
The practical rule: if you find yourself building deeply coupled orchestration logic that only makes sense if the current model limitations persist, stop and redesign. Build for capability boundaries that are likely to move.
4. OpenTelemetry from Day One
In 2026, OpenTelemetry is table stakes for production AI systems. Every major observability platform — Datadog, New Relic, LangSmith — natively supports GenAI semantic conventions. Instrument against OTel from the start rather than building proprietary telemetry you will have to replace later.
The model: every user request is a trace. Every agent call, tool call, retrieval, and handoff is a span. Tag each span with provider, model, token counts, and latency. This gives you the full execution path of any run from a single trace ID.
from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes
tracer = trace.get_tracer("my-agent")
def llm_call_with_tracing(model: str, messages: list, agent_name: str):
with tracer.start_as_current_span(f"llm.{agent_name}") as span:
span.set_attribute(SpanAttributes.LLM_SYSTEM, "anthropic")
span.set_attribute(SpanAttributes.LLM_REQUEST_MODEL, model)
span.set_attribute("agent.name", agent_name)
response = client.messages.create(model=model, messages=messages, max_tokens=4096)
span.set_attribute(SpanAttributes.LLM_USAGE_PROMPT_TOKENS,
response.usage.input_tokens)
span.set_attribute(SpanAttributes.LLM_USAGE_COMPLETION_TOKENS,
response.usage.output_tokens)
return response
5. Async Everything, From Day One
LLM API calls are slow — a single Opus 4.8 call can take 10-30 seconds. If your pipeline runs 6 agents sequentially and each takes 20 seconds, you have a 2-minute pipeline. Design for async from the first line. Even if v1 runs sequentially, use async functions throughout. Adding parallelism later is trivial. Retrofitting sync code to async is a major refactor.
import asyncio
async def run_pipeline(goal: str):
# Phase 1: parallel — no dependency between these
research, context = await asyncio.gather(
researcher.run(goal),
context_agent.run(goal)
)
# Phase 2: sequential — coder depends on both
code = await coder.run(goal, research=research, context=context)
# Phase 3: parallel — three audit passes on same code
audit_results = await asyncio.gather(
auditor.run(code, pass_num=1),
auditor.run(code, pass_num=2),
auditor.run(code, pass_num=3),
)
return await documenter.run(code, audits=audit_results)
6. Log Every Token, From the Start
LLM costs are invisible until they are catastrophic. A pipeline costing $0.50 per run at 10 runs/day is $150/month. At 100 runs/day it is $1,500/month. Log input tokens, output tokens, provider, model, and latency for every LLM call. Aggregate by agent, model, and day. You will find that one agent consumes 70% of your budget — it is usually one you did not expect.
async function callWithLogging(provider, messages, model) {
const start = Date.now();
const response = await provider.call(messages, model);
const latency = Date.now() - start;
db.prepare(`
INSERT INTO llm_usage (provider, model, input_tokens, output_tokens, latency_ms, ts)
VALUES (?, ?, ?, ?, ?, unixepoch())
`).run(
provider.name,
model,
response.usage.input_tokens,
response.usage.output_tokens,
latency
);
return response;
}
7. RAG Before Raw LLM: Retrieval-First Design
Grounding your agents in retrieved context before asking them to reason cuts hallucinations significantly. The pattern: retrieval first (search your knowledge base, fetch the relevant docs, pull the current state), then pass the retrieved context to the LLM with citations. The LLM reasons over real data, not memory.
Require citation: tell the agent to include the source ID for every claim. This makes hallucinations detectable — if the agent cites a source that does not support the claim, you catch it downstream. Without citation, hallucinations are invisible.
8. Version Your Prompts Like Code
Prompts are code. They have versions. Changes to them should be tracked, reviewed, and rollable. Minimum viable setup: store each prompt in a file with a version identifier, commit to git, and log which prompt version produced each output.
prompts/
auditor/
v1.0.txt # original auditor prompt
v1.1.txt # added security checklist
v2.0.txt # restructured for three-pass
current -> v2.0.txt
# Log which version ran
{
"run_id": "run_abc123",
"agent": "auditor",
"prompt_version": "v2.0",
"model": "claude-opus-4-8-20260528",
"output_hash": "sha256:..."
}
9. Local LLM as Fallback, Always
Every production AI system should have a local LLM fallback. Ollama makes this trivial. The quality is lower than Opus 4.8 or GPT-5.5, but for classification and routing tasks it is good enough. And it is always available — which matters most when you demo live and the API is degraded.
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:7b # code tasks
ollama pull llama3.2:3b # fast general purpose
ollama pull deepseek-r1:8b # reasoning, slightly larger
# OpenAI-compatible — zero code changes needed
# base_url = "http://localhost:11434/v1"
10. The Anti-Patterns That Will Burn You
- Trusting LLM output without validation. Parse and validate the schema before the next agent runs. Never pass raw LLM output to the next stage unchecked.
- Temperature above 0 for structured output. Random temperature means random JSON. Your parser will fail intermittently and it looks like a parser bug.
- Sharing one conversation history across agents. Context window pollution. Agent 5 should not see Agent 1's 15,000-token conversation. Use the Blackboard. Pass structured artifacts.
- No rate limit tracking. You will get 429 errors in bursts. Use a sliding window counter and pre-rate-limit yourself.
- Deeply coupled agent harnesses. When the next model version makes your orchestration logic obsolete, you need to be able to replace a module, not rewrite the system.
- Hardcoded system prompts in source code. The first time you need to update a prompt without redeploying the server, you will regret this.
11. The Right Stack for 2026
-- Local single-agent prototype --
Python + Ollama + SQLite + Jupyter
-- Production single-tenant agent pipeline --
Node.js (ESM) or Python asyncio
SQLite (state) + Redis (queues if needed)
Multi-provider LLM router + Ollama fallback
SHA-256 response cache
Circuit breakers per provider
SSE streaming for progress
OpenTelemetry for tracing
-- Multi-tenant hosted platform --
FastAPI + Postgres + Redis
JWT auth + row-level security
E2B sandboxing per run
Stripe/Razorpay billing
LangGraph or CrewAI for orchestration
Kubernetes or Fly.io for isolation